# Multimodal QA

Vilt Gqa Ft
A vision-language model based on ViLT architecture, fine-tuned specifically for GQA visual reasoning tasks
Text-to-Image Transformers
V
phucd
62
0
VL Rethinker 7B 6bit
Apache-2.0
This is a multimodal model based on Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks, converted to MLX format for efficient operation on Apple chips.
Text-to-Image Transformers English
V
mlx-community
19
0
VL Rethinker 7B 8bit
Apache-2.0
VL-Rethinker-7B-8bit is a multimodal model based on Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks.
Text-to-Image Transformers English
V
mlx-community
21
0
Tinyllava Video Qwen2.5 3B Group 16 512
Apache-2.0
TinyLLaVA-Video is a video understanding model based on Qwen2.5-3B and siglip-so400m-patch14-384, utilizing a grouped resampler for video frame processing
Video-to-Text
T
Zhang199
76
0
Videochat Flash Qwen2 7B Res224
Apache-2.0
A multimodal model built on UMT-L and Qwen2-7B, supporting long video understanding with only 16 tokens per frame and an extended context window of 128k.
Video-to-Text Transformers English
V
OpenGVLab
80
6
Lava Phi
MIT
A vision-language model based on Microsoft's Phi-1.5 architecture, combined with CLIP for image processing capabilities
Image-to-Text Transformers Supports Multiple Languages
L
sagar007
17
0
Idefics3 8B Llama3
Apache-2.0
Idefics3 is an open-source multimodal model capable of processing arbitrary sequences of image and text inputs to generate text outputs. It shows significant improvements in OCR, document understanding, and visual reasoning.
Image-to-Text Transformers English
I
HuggingFaceM4
45.86k
277
Idefics2 8b
Apache-2.0
Idefics2 is an open-source multimodal model capable of accepting arbitrary sequences of image and text inputs to generate text outputs. It shows significant improvements in OCR, document understanding, and visual reasoning.
Image-to-Text Transformers English
I
HuggingFaceM4
14.99k
603
Llava Phi2
MIT
Llava-Phi2 is a multimodal implementation based on Phi2, combining vision and language processing capabilities, suitable for image-text-to-text tasks.
Image-to-Text Transformers English
L
RaviNaik
153
6
Video Blip Opt 2.7b Ego4d
MIT
VideoBLIP is an enhanced version of BLIP-2 capable of processing video data, using OPT-2.7b as the language model backbone.
Video-to-Text Transformers English
V
kpyu
429
16
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase